Final Project
James Adams
NaNsgroupby on race_year_id to count and fill in missing participation numbers:zero_participant_races = races[races.participants == 0].race_year_id.unique()
participant_counts = dict(rankings[rankings.race_year_id.isin(zero_participant_races)].groupby('race_year_id').runner.count())
for k, v in participant_counts.items():
detailed_results['participants'].where(~(detailed_results.race_year_id == k), other=v, inplace=True)Can you predict the finishing time of a
given athlete profile for a given race?
| city | distance | elevation_gain | elevation_loss | aid_stations | participants | |
|---|---|---|---|---|---|---|
| 0 | Castleton | 166.90 | 4520 | -4520 | 10 | 150 |
| 1 | Castleton | 166.90 | 4520 | -4520 | 10 | 150 |
| 2 | Castleton | 166.90 | 4520 | -4520 | 10 | 150 |
| 3 | Castleton | 166.90 | 4520 | -4520 | 10 | 150 |
| 4 | Castleton | 166.90 | 4520 | -4520 | 10 | 150 |
| time | runner_age | distance | elevation_gain | elevation_loss | aid_stations | participants | runner_gender | runner_nationality_AND | runner_nationality_ARG | ... | city_Yibin | city_Yichang | city_Ystad | city_Zagreb | city_Zalesie | city_Zhaotong | city_Äkäslompolo | city_Åsa | city_Örebro | city_İstanbul | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 95725.00 | 30 | 166.90 | 4520 | -4520 | 10 | 150 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 97229.00 | 43 | 166.90 | 4520 | -4520 | 10 | 150 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 103747.00 | 38 | 166.90 | 4520 | -4520 | 10 | 150 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 111217.00 | 55 | 166.90 | 4520 | -4520 | 10 | 150 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 117981.00 | 48 | 166.90 | 4520 | -4520 | 10 | 150 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 430 columns
cross_validate with the training data to obtain model metricsX = data_dm.drop(columns='time')
y = data_dm.time
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2022)
lr = LinearRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=2022)
val_scores = cross_validate(lr, X_train, y_train, cv=kf, scoring=('r2', 'neg_mean_squared_error'), return_train_score=True)
----- Cross Validation Results -----
Train RMSE: 18669.24176487618
Train RMSE as hours: 5.185900490243384
Train R2: 0.7445142195526304
Test RMSE: 6292082509.582213
Test RMSE as hours: 1747800.6971061705
Test R2: -56745276872.53794
-------------------------------------
SelectKBest to find the best variables to include in our model# create dictionaries to store results
scores = {}
rmses = {}
# loop through all the possible numbers of variables included in the model,
# fit each one using a pipeline with SelectKBest and a Linear Regression model,
# and store the results in the dictionaries
for n in range(1, 430):
lr_selected = make_pipeline(SelectKBest(f_regression, k=n), LinearRegression())
lr_selected.fit(X_train, y_train)
score = lr_selected.score(X_test, y_test)
scores[str(n)] = score
rmse = np.sqrt(metrics.mean_squared_error(y_test, lr_selected.predict(X_test)))
rmses[str(n)] = rmselr_model = make_pipeline(SelectKBest(f_regression, k=406), LinearRegression())
new_val_scores = cross_validate(lr_model, X_train, y_train, cv=kf, scoring=('r2', 'neg_mean_squared_error'), return_train_score=True)
lr_model.fit(X_train, y_train)
# Calculate R2
lr_model.score(X_test, y_test)
# Calculate RMSE
np.sqrt(metrics.mean_squared_error(y_test, lr_model.predict(X_test)))
----- Cross Validation Results -----
Train RMSE: 18671.68085122933
Train RMSE as hours: 5.186578014230369
Train R2: 0.7444475485541255
Test RMSE: 18780.716818056364
Test RMSE as hours: 5.216865782793435
Test R2: 0.7413891601348425
-------------------------------------
----- Final Model Results -----
R2: 0.74
RMSE: 18952.05
RMSE in hrs: 5.26
-------------------------------
----- Predicted Outcome -----
For a 20 year old male from GBR, running a 155 km race in Zagreb
with an elevation gain of 600 ft, 100 other runners, and 10 aid stations.
Predicted finishing time: 1 days 03:00:13.388959575
-----------------------------
SelectKBestfrom sklearn.linear_model import Ridge
# instantiate Ridge Regression object
rdg = Ridge()
# perform cross validation for Ridge Regression model
rdg_val_scores = cross_validate(rdg, X_train, y_train, cv=kf, scoring=('r2', 'neg_mean_squared_error'), return_train_score=True)
# fit and score model
rdg.fit(X_train, y_train)
rdg.score(X_test, y_test)
# enter new data for a prediction from Ridge model
rdg_predict_new_runner(20, 0, "GBR", 155, 1000, 400, 10, 100, "Zagreb")
----- Cross Validation Results -----
Train RMSE: 18678.84421521424
Train RMSE as hours: 5.1885678375595115
Train R2: 0.7442513730579005
Test RMSE: 18780.14212984383
Test RMSE as hours: 5.216706147178841
Test R2: 0.74140731798193
-------------------------------------
----- Predicted Outcome -----
For a 20 year old male from GBR, running a 155 km race in Zagreb
with an elevation gain of 600 ft, 100 other runners, and 10 aid stations.
Predicted finishing time: 1 days 02:54:09.802431778
-----------------------------